Use weak references to refer to the original data in the accessor #880

freddyaboulton · 2021-05-03T14:14:55Z

As a user of woodwork, I noticed that the table and column accessors use a strong reference to the original dataframe/series. This prevents the garbage collector from freeing up the memory taken up by the original data because the reference count is always at least 1 since the accessor always points to the original data. We should use a weak reference to allow the garbage collector to free up the memory. To see how this would work, see #881

In order to convince myself this was happening, I used the following script I took from this blog post.

import gc
import pandas as pd
import woodwork as ww


def dump_garbage():
    gc.collect()

    print("\nUNCOLLECTABLE OBJECTS:")
    for x in gc.garbage:
        s = str(x)
        if len(s) > 80:
            s = s[:77]+'...'
        print(type(x), "\n  ", s)


def non_leaky_list():
    l = []
    l.append(1)
    del l
    dump_garbage()


def leaky_list():
    l = []
    l.append(l)
    del l
    dump_garbage()


def make_dataframe():
    df = pd.DataFrame({"a": [1, 2, 3],
                       "b": [4, 5, 6]})
    del df
    dump_garbage()


def make_ww_dataframe():
    df = pd.DataFrame({"a": [1, 2, 3],
                       "b": [4, 5, 6]})
    df.ww.init()
    del df
    dump_garbage()


if __name__ == "__main__":
    gc.enable()
    gc.set_debug(gc.DEBUG_SAVEALL)
    print("Non Leaky List")
    non_leaky_list()
    print("Leaky list")
    leaky_list()
    print("Make a dataframe")
    make_dataframe()
    print("Make a WW dataframe")
    make_ww_dataframe()

The leaky_list and non_leaky_list functions were added to sanity check that only leaky objects appear in gc.garbage.

The output should be this - and we can see the pandas dataframe with woodwork is now "uncollectable":

Non Leaky List

UNCOLLECTABLE OBJECTS:
Leaky list

UNCOLLECTABLE OBJECTS:
<class 'list'>
   [[...]]
Make a dataframe

UNCOLLECTABLE OBJECTS:
<class 'list'>
   [[...]]
Make a WW dataframe

UNCOLLECTABLE OBJECTS:
<class 'list'>
   [[...]]
<class 'pandas.core.frame.DataFrame'>
      a  b
0  1  4
1  2  5
2  3  6
<class 'pandas.core.indexes.base.Index'>
   Index(['a', 'b'], dtype='object')
<class 'dict'>
   {'_data': array(['a', 'b'], dtype=object), '_index_data': array(['a', 'b'], d...
<class 'pandas.core.indexes.range.RangeIndex'>
   RangeIndex(start=0, stop=3, step=1)
<class 'dict'>
   {'_range': range(0, 3), '_name': None, '_cache': {}, '_id': <object object at...
<class 'pandas.core.internals.blocks.IntBlock'>
   IntBlock: slice(0, 2, 1), 2 x 3, dtype: int64
<class 'pandas._libs.internals.BlockPlacement'>
   BlockPlacement(slice(0, 2, 1))
<class 'slice'>
   slice(0, 2, 1)
<class 'pandas.core.internals.managers.BlockManager'>
   BlockManager
Items: Index(['a', 'b'], dtype='object')
Axis 1: RangeIndex(star...
<class 'list'>
   [Index(['a', 'b'], dtype='object'), RangeIndex(start=0, stop=3, step=1)]
<class 'tuple'>
   (IntBlock: slice(0, 2, 1), 2 x 3, dtype: int64,)
<class 'dict'>
   {'_is_copy': None, '_mgr': BlockManager
Items: Index(['a', 'b'], dtype='objec...
<class 'pandas.core.flags.Flags'>
   <Flags(allows_duplicate_labels=True)>
<class 'weakref'>
   <weakref at 0x7fb968458db0; dead>
<class 'dict'>
   {'_allows_duplicate_labels': True, '_obj': <weakref at 0x7fb968458db0; dead>}
<class 'woodwork.table_accessor.PandasTableAccessor'>
          Physical Type Logical Type Semantic Tag(s)
Column                     ...
<class 'dict'>
   {'_dataframe':    a  b
0  1  4
1  2  5
2  3  6, '_schema':        Logical Typ...
<class 'cell'>
   <cell at 0x7fb968490e20: numpy.ndarray object at 0x7fb968466930>
<class 'tuple'>
   (<cell at 0x7fb968490e20: numpy.ndarray object at 0x7fb968466930>,)
<class 'function'>
   <function Index._engine.<locals>.<lambda> at 0x7fb9703ae9d0>
<class 'pandas._libs.index.ObjectEngine'>
   <pandas._libs.index.ObjectEngine object at 0x7fb8f00ac090>
<class 'dict'>
   {'_engine': <pandas._libs.index.ObjectEngine object at 0x7fb8f00ac090>, 'is_u...
<class 'pandas.core.internals.blocks.IntBlock'>
   IntBlock: 3 dtype: int64
<class 'pandas._libs.internals.BlockPlacement'>
   BlockPlacement(slice(0, 3, 1))
<class 'slice'>
   slice(0, 3, 1)
<class 'pandas.core.internals.managers.SingleBlockManager'>
   SingleBlockManager
Items: RangeIndex(start=0, stop=3, step=1)
IntBlock: 3 dty...
<class 'list'>
   [RangeIndex(start=0, stop=3, step=1)]
<class 'tuple'>
   (IntBlock: 3 dtype: int64,)
<class 'pandas.core.series.Series'>
   0    1
1    2
2    3
Name: a, dtype: int64
<class 'dict'>
   {'_is_copy': None, '_mgr': SingleBlockManager
Items: RangeIndex(start=0, stop...
<class 'pandas.core.flags.Flags'>
   <Flags(allows_duplicate_labels=True)>
<class 'weakref'>
   <weakref at 0x7fb9703aa680; dead>
<class 'dict'>
   {'_allows_duplicate_labels': True, '_obj': <weakref at 0x7fb9703aa680; dead>}
<class 'dict'>
   {'a': 0    1
1    2
2    3
Name: a, dtype: int64, 'b': 0    4
1    5
2    6
N...
<class 'tuple'>
   ('a', <weakref at 0x7fb968458db0; dead>)
<class 'cell'>
   <cell at 0x7fb9804f9df0: function object at 0x7fb9703ae940>
<class 'cell'>
   <cell at 0x7fb9804f9dc0: TypeSystem object at 0x7fb99030d6d0>
<class 'list'>
   [IntegerNullable, Integer]
<class 'tuple'>
   ([IntegerNullable, Integer],)
<class 'tuple'>
   (<cell at 0x7fb9804f9df0: function object at 0x7fb9703ae940>, <cell at 0x7fb9...
<class 'function'>
   <function TypeSystem.infer_logical_type.<locals>.get_inference_matches at 0x7...
<class 'pandas.core.internals.blocks.IntBlock'>
   IntBlock: 3 dtype: int64
<class 'pandas._libs.internals.BlockPlacement'>
   BlockPlacement(slice(0, 3, 1))
<class 'slice'>
   slice(0, 3, 1)
<class 'pandas.core.internals.managers.SingleBlockManager'>
   SingleBlockManager
Items: RangeIndex(start=0, stop=3, step=1)
IntBlock: 3 dty...
<class 'list'>
   [RangeIndex(start=0, stop=3, step=1)]
<class 'tuple'>
   (IntBlock: 3 dtype: int64,)
<class 'pandas.core.series.Series'>
   0    4
1    5
2    6
Name: b, dtype: int64
<class 'dict'>
   {'_is_copy': None, '_mgr': SingleBlockManager
Items: RangeIndex(start=0, stop...
<class 'pandas.core.flags.Flags'>
   <Flags(allows_duplicate_labels=True)>
<class 'weakref'>
   <weakref at 0x7fb98054a810; dead>
<class 'dict'>
   {'_allows_duplicate_labels': True, '_obj': <weakref at 0x7fb98054a810; dead>}
<class 'tuple'>
   ('b', <weakref at 0x7fb968458db0; dead>)
<class 'cell'>
   <cell at 0x7fb9804f97f0: function object at 0x7fb9703ae790>
<class 'cell'>
   <cell at 0x7fb9804f9820: TypeSystem object at 0x7fb99030d6d0>
<class 'list'>
   [IntegerNullable, Integer]
<class 'tuple'>
   ([IntegerNullable, Integer],)
<class 'tuple'>
   (<cell at 0x7fb9804f97f0: function object at 0x7fb9703ae790>, <cell at 0x7fb9...
<class 'function'>
   <function TypeSystem.infer_logical_type.<locals>.get_inference_matches at 0x7...
<class 'woodwork.table_schema.TableSchema'>
          Logical Type Semantic Tag(s)
Column
a    ...
<class 'woodwork.column_schema.ColumnSchema'>
   <ColumnSchema (Logical Type = Integer) (Semantic Tags = ['numeric'])>
<class 'dict'>
   {'metadata': {}, 'description': None, 'logical_type': Integer, 'use_standard_...
<class 'set'>
   {'numeric'}
<class 'dict'>
   {'a': <ColumnSchema (Logical Type = Integer) (Semantic Tags = ['numeric'])>, ...
<class 'woodwork.column_schema.ColumnSchema'>
   <ColumnSchema (Logical Type = Integer) (Semantic Tags = ['numeric'])>
<class 'dict'>
   {'metadata': {}, 'description': None, 'logical_type': Integer, 'use_standard_...
<class 'set'>
   {'numeric'}
<class 'dict'>
   {'name': None, 'columns': {'a': <ColumnSchema (Logical Type = Integer) (Seman...
<class 'pandas.core.series.Series'>
   0       a  b
1    0  1  4
2    1  2  5
3    2  3  6
dtype: object
<class 'pandas.core.indexes.range.RangeIndex'>
   RangeIndex(start=0, stop=4, step=1)
<class 'dict'>
   {'_range': range(0, 4), '_name': None, '_cache': {}, '_id': <object object at...
<class 'pandas.core.internals.blocks.ObjectBlock'>
   ObjectBlock: 4 dtype: object
<class 'pandas._libs.internals.BlockPlacement'>
   BlockPlacement(slice(0, 4, 1))
<class 'slice'>
   slice(0, 4, 1)
<class 'pandas.core.internals.managers.SingleBlockManager'>
   SingleBlockManager
Items: RangeIndex(start=0, stop=4, step=1)
ObjectBlock: 4 ...
<class 'list'>
   [RangeIndex(start=0, stop=4, step=1)]
<class 'tuple'>
   (ObjectBlock: 4 dtype: object,)
<class 'dict'>
   {'_is_copy': None, '_mgr': SingleBlockManager
Items: RangeIndex(start=0, stop...
<class 'pandas.core.flags.Flags'>
   <Flags(allows_duplicate_labels=True)>
<class 'weakref'>
   <weakref at 0x7fb9607e4270; dead>
<class 'dict'>
   {'_allows_duplicate_labels': True, '_obj': <weakref at 0x7fb9607e4270; dead>}
<class 'pandas.core.strings.accessor.StringMethods'>
   <pandas.core.strings.accessor.StringMethods object at 0x7fb9601e38e0>
<class 'dict'>
   {'_inferred_dtype': 'string', '_is_categorical': False, '_is_string': False, ...

The text was updated successfully, but these errors were encountered:

freddyaboulton added the new feature suggestions for new functionality label May 3, 2021

freddyaboulton mentioned this issue May 3, 2021

POC: Use a weak reference to the original data #881

Closed

gsheni added evalml EvalML request needs design Issues requiring design documentation. labels May 3, 2021

thehomebrewnerd self-assigned this May 5, 2021

This was referenced May 5, 2021

Update accessors to store weak reference to data #894

Merged

BUG: Memory leak when using custom accessor pandas-dev/pandas#41357

Open

thehomebrewnerd closed this as completed in #894 May 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use weak references to refer to the original data in the accessor #880

Use weak references to refer to the original data in the accessor #880

freddyaboulton commented May 3, 2021 •

edited

Use weak references to refer to the original data in the accessor #880

Use weak references to refer to the original data in the accessor #880

Comments

freddyaboulton commented May 3, 2021 • edited

freddyaboulton commented May 3, 2021 •

edited